Ceci n’est pas une %>%
Exploring Your Data with R


Michael Clark
Statistician Lead

Preliminaries

Link to slides:
https://m-clark.github.io/workshops/dplyr/mainSlides.html

  • Start Rstudio
  • Install the following packages: dplyr, tidyverse, magrittr, plotly

You will get a pop-under window on the lab machines!

  • Affirm that you want to use a personal library.

As we go along, you can install other packages if you want to run the demos (not necessary)

  • magrittr, plyr,tidyr, plotly, rvest, stringr, leaflet

The Reference

The Treachery of Images by Rene Magritte

Goals

  • Introduce newer approaches to data wranging, scrubbing, manipulation etc.
  • Show the benefits of piping code
  • Put it all together with some newer visualization packages

Outline

Newer approaches to data wrangling

  • Introduction to plyr dplyr and tidyr (the tidyverse)
  • Subsetting rows
  • Subsetting columns
  • Reshaping data
  • Generating new data
  • Grouping and summarizing

Nothin’s gonna stop the flow

  • More piping with the magrittr package

Quick interactive visualizations

  • plotly, htmlwidgets

Newer approaches to
data wrangling

Newer approaches to data wrangling

Let’s say we want to select the following from our data:

  • Start with the ID variable
  • The variables X1:X10, which are not all together, not the only X* variables
  • The variables var1 and var2, which are the only var variables in the data
  • Any variable that starts with XYZ

How might we go about this?

Some base R approaches

Typically tedious

Multiple steps just to get the columns you want

# numeric indexes; not conducive to readibility or reproducibility
newData = oldData[,c(1,2,3,4, etc.)]

# explicitly by name; fine if only a handful; not pretty
newData = oldData[,c('ID','X1', 'X2', etc.)]

# two step with grep; regex difficult to read/understand
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))
newData = oldData[,cols]

# or via subset
newData = subset(oldData, select = cols)

More

What if you also want observations where Z is Yes, Q is No ordered by var1 (descending)… and only the last 50 of those results?

# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = tail(newData, 50)
newData = newdata[order(newdata$var1, decreasing=T),]

And this is for fairly straightforward operations.

An alternative

newData = oldData %>% 
  filter(Z == 'Yes', Q == 'No') %>% 
  select(num_range('X', 1:10), contains('var'), starts_with('XYZ')) %>% 
  arrange(desc(var1)) %>% 
  tail(50)

An alternative

Piping is an alternative

You can do all this sort of stuff with base R

  • with, within, subset, $, etc.

While the base R approach can be concise, it is potentially:

  • noisier
  • less legible
  • less amenable to additional data changes
  • requires esoteric knowledge (e.g. regular expressions)
  • often requires new objects (even if we just want to explore)

Piping

%>% : Passes the prior object to the function after the pipe

  • x %>% f same as f(x)
object %>% function(object)

Piping

Another example…

Start with a string, end with a map

wikiURL = 'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population'

# Let's go!
wikiURL %>% 
  read_html() %>%                                                                   # parse the html
  html_node(css='.wikitable.sortable') %>%                                          # grab a class of object
  html_table() %>%                                                                  # convert table to data.frame
  sapply(function(x) repair_encoding(as.character(x), 'UTF-8')) %>%                 # repair encoding; makes a matrix
  data.frame() %>%                                                                  # back to df
  mutate(City = str_replace(City, '\\[(.*?)\\]', ''),                               # remove footnotes
         latlon = sapply(str_split(Location, '/'), last),                           # split up location (3 parts)
         latlon = str_extract_all(latlon, '[-|[0-9]]+\\.[0-9]+'),                   # grab any that start with - or number
         lat = sapply(latlon, first),                                               # grab latitudes
         lon = sapply(latlon, nth, 2),                                              # grab longitude
         population2016 = as.numeric(str_replace_all(X2016.estimate, ',', '')),     # remove commas from numbers (why do people do this?)
         population2010 = as.numeric(str_replace_all(X2010.Census, ',', '')),       # same for 2010
         popDiff  = round(population2016/population2010 - 1, 2)*100) %>%            # create percentage difference

Cont’d.

  select(-latlon, -Location) %>%                                                    # remove stuff we wouldn't ever use
  filter(as.numeric(as.character(X2016.rank)) <= 50)  %>%                           # top 50
  plot_geo(locationmode = 'USA-states', sizes = c(1, 250)) %>%                      # map with plotly
  add_markers(x = ~lon, y = ~lat, size = ~abs(popDiff), hoverinfo='text', 
              color=~popDiff, hoverinfo='text', colors='RdBu',
              text=~hovertext, marker = list(opacity = 0.5)) %>%  
  layout(title = 'Largest US cities and their 2010-2016 change', 
         geo = g, paper_bgcolor='#fdf6e3')

# g is a list of map options, e.g. projection line colors etc.

And the result…

For your own code, try a more concise approach

  • Better to do the parsed page, data stuff, and plot as separate steps

However, it serves as an illustration of what’s possible

  • Key commands: mutate, select, filter
  • Autocomplete for unquoted variable names
  • Pipe to other functions (e.g. visualization)

Newer approaches to data wrangling

Packages have been created to make data wrangling easier

We will focus on plyr, dplyr, and tidyr

But others, e.g. data.table, may be useful as well

Newer visualization packages work similarly

  • Makes it easier to explore your data visually

A provocation

c('Ceci', "n'est", 'pas', 'une', 'pipe!') %>%
{
  .. <-  . %>%
    if (length(.) == 1)  .
    else paste(.[1], '%>%', ..(.[-1]))
  ..(.)
} 

c('Ceci', "n'est", 'pas', 'une', 'pipe!') %>%
{
  .. <-  . %>%
    if (length(.) == 1)  .
    else paste(.[1], '%>%', ..(.[-1]))
  ..(.)
} 
[1] "Ceci %>% n'est %>% pas %>% une %>% pipe!"

Your turn…

Your turn

Let’s get to it!

  • Use a base R dataset
  • Examples: iris, mtcars, faithful or state.x77; library(help=‘datasets’)
  • Pipe to something like the summary, plot or cor (if all numeric) as follows:
data %>% 
  function
  • Use additional arguments if you want:
data %>% 
  function(arg='blah')

Note that Ctrl+Shft+m is the shortcut to make the %>% pipe.

Example

iris %>% summary
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  

Data Wrangling

Generating New Data

Generating New Data

We often want to calculate new variables

Or perhaps make changes to ones we have

We can use mutate or transmute for this

  • mutate appends to current data
  • mutate_all will apply a function over multiple columns
  • mutate_if will apply under some condition
  • mutate_at will apply to specific columns
  • transmute will return only the newly created data

First, let’s scrape some data:

url = "http://www.basketball-reference.com/leagues/NBA_2015_totals.html"
bball = read_html(url) %>% 
  html_nodes("#totals_stats") %>% 
  html_table %>% 
  data.frame %>%
  filter(Rk != "Rk")

Generating New Data

The data is currently all character strings

glimpse(bball[,1:5])
Observations: 651
Variables: 5
$ Rk     <chr> "1", "2", "3", "4", "5", "5", "5", "6", "7", "8", "9", "10", "11", "12", "13", "13", "13", "14", "15...
$ Player <chr> "Quincy Acy", "Jordan Adams", "Steven Adams", "Jeff Adrien", "Arron Afflalo", "Arron Afflalo", "Arro...
$ Pos    <chr> "PF", "SG", "C", "PF", "SG", "SG", "SG", "C", "PF", "C", "PF", "C", "SG", "SF", "PF", "PF", "PF", "C...
$ Age    <chr> "24", "20", "21", "28", "29", "29", "29", "26", "23", "26", "29", "25", "33", "24", "32", "32", "32"...
$ Tm     <chr> "NYK", "MEM", "OKC", "MIN", "TOT", "DEN", "POR", "NOP", "PHI", "NYK", "POR", "IND", "MEM", "DAL", "T...

Generating New Data

We’ll use mutate_at to make appropriate columns numeric

  • Not Player, Pos, and Tm.
bball = bball %>% 
  mutate_at(vars(-Player, -Pos, -Tm), funs(as.numeric))   

glimpse(bball[,1:7])
Observations: 651
Variables: 7
$ Rk     <dbl> 1, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,...
$ Player <chr> "Quincy Acy", "Jordan Adams", "Steven Adams", "Jeff Adrien", "Arron Afflalo", "Arron Afflalo", "Arro...
$ Pos    <chr> "PF", "SG", "C", "PF", "SG", "SG", "SG", "C", "PF", "C", "PF", "C", "SG", "SF", "PF", "PF", "PF", "C...
$ Age    <dbl> 24, 20, 21, 28, 29, 29, 29, 26, 23, 26, 29, 25, 33, 24, 32, 32, 32, 36, 32, 21, 26, 20, 30, 32, 32, ...
$ Tm     <chr> "NYK", "MEM", "OKC", "MIN", "TOT", "DEN", "POR", "NOP", "PHI", "NYK", "POR", "IND", "MEM", "DAL", "T...
$ G      <dbl> 68, 30, 70, 17, 78, 53, 25, 68, 41, 61, 71, 63, 63, 74, 53, 12, 41, 60, 74, 33, 61, 81, 40, 49, 63, ...
$ GS     <dbl> 22, 0, 67, 0, 72, 53, 19, 8, 9, 16, 71, 0, 41, 3, 35, 0, 35, 20, 19, 8, 5, 71, 40, 0, 3, 82, 4, 76, ...

Mutate

mutate takes a vector and returns one of the same dimension

  • ‘window’ function


We will contrast it with summarize later

  • or summarise if you prefer
  • Takes a vector and returns smaller one (aggregates)

Generating New Data

A common task- creating composites of existing variables

bball = bball %>% 
  mutate(trueShooting = PTS / (2 * (FGA + (.44 * FTA))),
         effectiveFG = (FG + (.5 * X3P)) / FGA, 
         shootingDif = trueShooting - FG.)

summary(select(bball, shootingDif))  # select and others don't have to be piped to use
  shootingDif      
 Min.   :-0.08645  
 1st Qu.: 0.04193  
 Median : 0.08016  
 Mean   : 0.07855  
 3rd Qu.: 0.11292  
 Max.   : 0.25000  
 NA's   :2         

Generating New Data

Note how we use the new variables

We can use them without quotes

We can use use any variable as soon as it’s created

Generating New Data

Sometimes we want to combine (or split) variables…

unite creates a string combination of others

  • essentially paste
library(tidyr)
bball %>% 
  unite("posTeam", Pos, Tm) %>% 
  select(1:5) %>% 
  head
  Rk        Player posTeam Age  G
1  1    Quincy Acy  PF_NYK  24 68
2  2  Jordan Adams  SG_MEM  20 30
3  3  Steven Adams   C_OKC  21 70
4  4   Jeff Adrien  PF_MIN  28 17
5  5 Arron Afflalo  SG_TOT  29 78
6  5 Arron Afflalo  SG_DEN  29 53

Generating New Data

separate does the opposite

Separate player into first and last names based on the space

bball %>% 
  separate(Player, into=c('firstName', 'lastName'), sep=' ') %>% 
  select(1:5) %>% 
  head
  Rk firstName lastName Pos Age
1  1    Quincy      Acy  PF  24
2  2    Jordan    Adams  SG  20
3  3    Steven    Adams   C  21
4  4      Jeff   Adrien  PF  28
5  5     Arron  Afflalo  SG  29
6  5     Arron  Afflalo  SG  29

Your turn

Your turn

Data: state.x77

  • \(0.\) Convert to data.frame
  • \(1.\) Create a variable that’s the log of Population (mutate )
  • \(2.\) Create curLifeExp as Life Expectancy (Life.Exp) + 5 (mutate)
  • \(3.\) Summarize the data (summary )

Example

state.x77 %>% 
  data.frame %>% 
  mutate(popLog = log(Population),
         curLifeExp = Life.Exp+5) %>% 
  summary
   Population        Income       Illiteracy       Life.Exp         Murder          HS.Grad          Frost       
 Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96   Min.   : 1.400   Min.   :37.80   Min.   :  0.00  
 1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12   1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25  
 Median : 2838   Median :4519   Median :0.950   Median :70.67   Median : 6.850   Median :53.25   Median :114.50  
 Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88   Mean   : 7.378   Mean   :53.11   Mean   :104.46  
 3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89   3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75  
 Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60   Max.   :15.100   Max.   :67.30   Max.   :188.00  
      Area            popLog        curLifeExp   
 Min.   :  1049   Min.   :5.900   Min.   :72.96  
 1st Qu.: 36985   1st Qu.:6.984   1st Qu.:75.12  
 Median : 54277   Median :7.951   Median :75.67  
 Mean   : 70736   Mean   :7.863   Mean   :75.88  
 3rd Qu.: 81163   3rd Qu.:8.511   3rd Qu.:76.89  
 Max.   :566432   Max.   :9.962   Max.   :78.60  

Selecting Variables

Selecting Variables

Often you do not need the entire data set

Easily handled in base R (as shown earlier)

It can be more clear to use select in dplyr

  • Don’t have to create objects, use quotes or $, etc.
bball %>% 
  select(Player, Tm, Pos, MP, trueShooting, effectiveFG, PTS) %>% 
  summary
    Player               Tm                Pos                  MP        trueShooting     effectiveFG    
 Length:651         Length:651         Length:651         Min.   :   1   Min.   :0.0000   Min.   :0.0000  
 Class :character   Class :character   Class :character   1st Qu.: 272   1st Qu.:0.4719   1st Qu.:0.4402  
 Mode  :character   Mode  :character   Mode  :character   Median : 896   Median :0.5174   Median :0.4799  
                                                          Mean   :1042   Mean   :0.5060   Mean   :0.4728  
                                                          3rd Qu.:1674   3rd Qu.:0.5555   3rd Qu.:0.5180  
                                                          Max.   :2981   Max.   :1.0638   Max.   :1.0000  
                                                                         NA's   :2        NA's   :2       
      PTS        
 Min.   :   0.0  
 1st Qu.:  90.0  
 Median : 308.0  
 Mean   : 428.1  
 3rd Qu.: 658.0  
 Max.   :2217.0  
                 

Selecting Variables

What if we want to drop some variables?

  • Can’t calculate correlation on the string variables, so drop
scoringDat = bball %>%                                               # creating object as we will come back to it later
  select(Player, Tm, Pos, MP, trueShooting, effectiveFG, PTS)

scoringDat %>% 
  select(-Player, -Tm, -Pos) %>% 
  cor(use='complete') %>% 
  round(2)
               MP trueShooting effectiveFG  PTS
MP           1.00         0.34        0.30 0.93
trueShooting 0.34         1.00        0.96 0.33
effectiveFG  0.30         0.96        1.00 0.27
PTS          0.93         0.33        0.27 1.00

Selecting Variables

Sometimes, we have a lot of variables to select

If they have a common naming scheme, this can be very easy

bball %>% 
  select(Player, contains("3P"), ends_with("RB")) %>% 
  arrange(desc(TRB)) %>% 
  head
          Player X3P X3PA  X3P. ORB DRB  TRB
1 DeAndre Jordan   1    4 0.250 397 829 1226
2 Andre Drummond   0    2 0.000 437 667 1104
3      Pau Gasol  12   26 0.462 220 699  919
4 Tyson Chandler   0    0    NA 294 570  864
5 Nikola Vucevic   2    6 0.333 238 572  810
6    Rudy Gobert   0    2 0.000 265 510  775

Filtering Observations

Filtering Observations

Recall this bit of code?

bball = read_html(url) %>% 
  html_nodes("#totals_stats") %>% 
  html_table %>% 
  data.frame %>% 
  filter(Rk != "Rk")

You will notice the filter line at the end

We sometimes want to see a very specific portion of the data

Filtering Observations

  • filter returns rows with matching conditions.
  • slice allows for a numeric indexing approach.
  • Say we want too look at forwards (SF or PF) over the age of 35…
bball %>% 
  filter(Age > 35, Pos == "SF" | Pos == "PF")
  • or the first 10 rows…
bball %>% 
  slice(1:10)

Filtering Observations

This can be done with things that are created on the fly…

bball %>% 
  unite("posTeam", Pos, Tm) %>%             # create a new variable
  filter(posTeam == "PF_SAS") %>%           # use it for filtering
  arrange(desc(PTS/PF)) %>%                 # order by on-the-fly operation
  select(1:10)
   Rk         Player posTeam Age  G GS   MP  FG FGA   FG.
1 126     Boris Diaw  PF_SAS  32 81 15 1984 291 632 0.460
2  56    Matt Bonner  PF_SAS  34 72 19  935  94 230 0.409
3  26     Jeff Ayres  PF_SAS  27 51  0  383  55  95 0.579
4 187 JaMychal Green  PF_SAS  24  4  0   25   4   7 0.571

Your turn

Your turn

A brief exercise:

  • \(1.\) filter the iris data set to only the virginica species (var == value)
  • \(2.\) show only Petal.Length and Petal.Width variables (select)
  • \(3.\) bonus: redo, but instead, filter if the ratio of Petal.Length to Petal.Width is greater than 5.

  • Which species do these observations belong to?

Example

iris %>% 
  filter(Petal.Length/Petal.Width > 5) %>% 
  summary
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width           Species  
 Min.   :4.300   Min.   :2.900   Min.   :1.100   Min.   :0.1000   setosa    :34  
 1st Qu.:4.800   1st Qu.:3.100   1st Qu.:1.400   1st Qu.:0.2000   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.450   Median :0.2000   virginica : 0  
 Mean   :4.982   Mean   :3.382   Mean   :1.456   Mean   :0.1882                  
 3rd Qu.:5.200   3rd Qu.:3.575   3rd Qu.:1.500   3rd Qu.:0.2000                  
 Max.   :5.800   Max.   :4.200   Max.   :1.900   Max.   :0.3000                  

Reshaping Data

Reshaping Data

Sometimes you need to reshape your data

  • from wide to long
  • or vice versa

Wide to Long

Use the tidyr package

gather will take wide data melt it

  • key is the new variable name, labels are the original variable names
  • value is the name of the new variable that will contain their values
bballLong = bball %>% 
  select(Player, Tm, FG., X3P., X2P., trueShooting, effectiveFG) %>%
  rename(fieldGoalPerc = FG., threePointPerc = X3P., twoPointPerc = X2P.) %>% 
  gather(key = 'vitalInfo', value = 'value', -Tm, -Player) 

bballLong %>% head
         Player  Tm     vitalInfo value
1    Quincy Acy NYK fieldGoalPerc 0.459
2  Jordan Adams MEM fieldGoalPerc 0.407
3  Steven Adams OKC fieldGoalPerc 0.544
4   Jeff Adrien MIN fieldGoalPerc 0.432
5 Arron Afflalo TOT fieldGoalPerc 0.424
6 Arron Afflalo DEN fieldGoalPerc 0.428

Long to wide

Going the reverse direction - use spread

bballLong %>%
  spread(vitalInfo, value) %>% 
  head
        Player  Tm effectiveFG fieldGoalPerc threePointPerc trueShooting twoPointPerc
1   A.J. Price CLE   0.2647059         0.265          0.000    0.3002183        0.391
2   A.J. Price IND   0.5224719         0.438          0.385    0.5416839        0.480
3   A.J. Price PHO   0.2142857         0.214          0.000    0.2142857        0.429
4   A.J. Price TOT   0.4270073         0.372          0.263    0.4506641        0.450
5 Aaron Brooks CHI   0.4951040         0.421          0.387    0.5338198        0.442
6 Aaron Gordon ORL   0.4783654         0.447          0.271    0.5173735        0.500

Grouping and Summarizing Data

Grouping and Summarizing Data

A very common task is to look at group-based statistics

We can use group_by and summarize

scoringDat %>% 
  group_by(Pos) %>%                                                 # can group by more than one variable
  summarize(meanTrueShooting = mean(trueShooting, na.rm = TRUE))    # potentially any function can work
# A tibble: 11 x 2
     Pos meanTrueShooting
   <chr>            <dbl>
 1     C        0.5444475
 2    PF        0.5117584
 3 PF-SF        0.4561743
 4    PG        0.4812083
 5 PG-SG        0.5115452
 6    SF        0.5089447
 7 SF-PF        0.5537975
 8 SF-SG        0.5498324
 9    SG        0.4948718
10 SG-PG        0.5435740
11 SG-SF        0.4706034

Grouping and Summarizing Data

Use do on grouped data to go further

state.x77 %>% 
  data.frame() %>% 
  mutate(Region=state.region) %>% 
  group_by(Region) %>% 
  do(PopIncCorr=cor(.$Population, .$Income)) %>%  
  data.frame()
         Region PopIncCorr
1     Northeast  0.4216813
2         South  0.2901941
3 North Central  0.2940437
4          West  0.1684014

Grouping and Summarizing Data

Use do on grouped data to go further

library(nycflights13)
carriers = group_by(flights, carrier)
group_size(carriers)
 [1] 18460 32729   714 54635 48110 54173   685  3260   342 26397    32 58665 20536  5162 12275   601
## mods = do(carriers, mod = lm(arr_delay ~ dep_time, data = .)) # reminder that data frames are lists
## sapply(mods$mod, summary)

Your Turn

Putting it all together

?state.x77

Using one pipe sequence

  • \(1.\) convert state.x77 (a base R object) to a data frame (data.frame )
  • \(2.\) create a new variable called Region that is equal to state.region (mutate )
  • \(3.\) create a new variable called State that is equal to state.name
  • \(4.\) filter only if Population is greater than 1000 (thousands)
  • \(5.\) select Region and variables beginning with the letter I
  • \(6.\) group by Region (group_by )
  • \(7.\) summarize Income, Illiteracy or Both, using the mean function

Example

state.x77 %>% 
  data.frame %>% 
  mutate(Region = state.region,
         State = state.name) %>% 
  filter(Population > 1000) %>% 
  select(Region, starts_with('I')) %>% 
  group_by(Region) %>% 
  summarize(meanInc=mean(Income),
            meanIll=mean(Illiteracy))
# A tibble: 4 x 3
         Region meanInc  meanIll
         <fctr>   <dbl>    <dbl>
1     Northeast  4731.0 1.066667
2         South  3958.8 1.793333
3 North Central  4607.9 0.710000
4          West  4525.0 1.085714

Nothin’s gonna stop the flow

More with pipes

Recap thus far:

%>% : Passes the prior object to the function after the pipe

  • x %>% f same as f(x)

Example:

iris %>% head
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

More with pipes

Other pipes are available with magrittr and other packages

%$%: Exposes the names in the prior to the function after

  • x %$% y(a, b) same as y(x$a, x$b)

Example (no data argument needed for lm):

iris %$% lm(Sepal.Length ~ Sepal.Width)

Call:
lm(formula = Sepal.Length ~ Sepal.Width)

Coefficients:
(Intercept)  Sepal.Width  
     6.5262      -0.2234  

More with pipes

%T>% : Passes to the next and subsequent steps

  • x %T>% y %>% z is the same as x %>% y & x %>% z

More with pipes

Example:

iris %>% select(Sepal.Length, Sepal.Width) %T>% plot %>% summary

  Sepal.Length    Sepal.Width   
 Min.   :4.300   Min.   :2.000  
 1st Qu.:5.100   1st Qu.:2.800  
 Median :5.800   Median :3.000  
 Mean   :5.843   Mean   :3.057  
 3rd Qu.:6.400   3rd Qu.:3.300  
 Max.   :7.900   Max.   :4.400  

More with pipes

Unfortunately the T pipe does not allow for some things

  • Works:
iris %>% select(Sepal.Length, Sepal.Width) %T>% plot %>% summary
  • Provides no summary:
iris %>% select(Sepal.Length, Sepal.Width) %T>% summary %>% plot
  • Somewhat limiting in my opinion.

More with pipes

%<>% : assigns to former object the operations that follow

  • Example:
x = c(1:3,9:7,10,4:6)
x %<>% .^2 %>% sort
x
 [1]   1   4   9  16  25  36  49  64  81 100

Piping for Visualization

Piping for Visualization

Piping is not limited to data management functions

Any R function can be potentially piped to

data.frame(y=rnorm(100), x=rnorm(100)) %$%
  lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
   -0.05802      0.13975  

This facilitates data exploration

htmlwidgets

Many newer interactive visualization packages take advantage of piping

htmlwidgets makes it easy to create js-based visualizations

  • i.e. what you see everywhere on the web.

Some htmlwidgets packages

leaflet

  • maps with OpenStreetMap

dygraphs

  • time series visualization

networkD3

  • Network visualization with D3

DT

  • Tabular data via DataTables

rthreejs

  • 3D graphics

leaflet

Use leaflet for quick and easy maps

leaflet() %>% 
  setView(lat=42.2655, lng=-83.7485, zoom=15) %>% 
  addTiles() %>% 
  addPopups(lat=42.2655, lng=-83.7485, 'Hi!')

dygraphs

Dygraphs focus on time-series objects

library(dygraphs)
data(UKLungDeaths)
cbind(ldeaths, mdeaths, fdeaths) %>% 
  dygraph()

networkD3

Use networkD3 for interactive networks

  • See also VisNetwork
library(networkD3)
forceNetwork(Links = netlinks, Nodes = netnodes, Source = "source",
             Target = "target", Value = "value", NodeID = "name",
             Nodesize = "size", Group = "group", opacity = 0.4, legend = T,
             colourScale = JS("d3.scale.category10()"), fontSize=24)

data tables

DT package provides interactive data

library(DT)
datatable(select(bball, 1:5), rownames=F)

Plotly

Plotly is a cross platform interactive visualization library

It also works with JS, Python, Matlab and others

Reminder of what’s in the data:

bballLong %>% head
         Player  Tm     vitalInfo value
1    Quincy Acy NYK fieldGoalPerc 0.459
2  Jordan Adams MEM fieldGoalPerc 0.407
3  Steven Adams OKC fieldGoalPerc 0.544
4   Jeff Adrien MIN fieldGoalPerc 0.432
5 Arron Afflalo TOT fieldGoalPerc 0.424
6 Arron Afflalo DEN fieldGoalPerc 0.428

Plotly

Plotly works by starting with a base layer…

to which subsequent layers are added…

with additional options if needed.

Layers are added through this same piping system.

library(plotly)
bballLong %>% 
  group_by(Tm, vitalInfo) %>%
  summarize(avg = mean(value)) %>% 
  plot_ly(width=700) %>% 
  add_markers(x=~Tm, y=~avg, color =~vitalInfo) %>% 
  layout()   # options go here

Your turn

Your turn

Using plotly and the data set mtcars, we’ll create a grouped scatterplot without creating any new objects.

Now, starting with your with the mtcars data set:

  1. Make a new variable called ‘amFactor’ that is just a factor of the original am, with labels ‘auto’ and ‘manual’
    • factor(am, labels=c(‘auto’, ‘manual’)) (mutate )
  2. Create your base plot (plot_ly )
  3. Make a scatterplot (add_markers ) of weight (wt) and miles per gallon (mpg)
    • color = amFactor

Example

prediction = predict(lm(mpg~wt, data=mtcars))

library(plotly)

mtcars %>% 
  mutate(amFactor = factor(am, labels=c('auto', 'manual'))) %>% 
  plot_ly(width=300, height=200) %>% 
  add_markers(~wt, ~mpg, color=~amFactor, alpha=.5, hover=c(~wt, ~mpg)) %>% 
  add_lines(x=~wt, y=prediction, type=2, name='prediction')

Example

Wrap up

Note that much of the functionality you see is in base R

  • with, within, tapply etc.

Now you now have more flexible approaches

Think of these packages as exploratory tools

  • Use them to bring clarity to code

  • Use them to explore your data more easily for visualization and modeling

With more use, the easier it gets, and the more you can do

Further Resources

Data wrangling cheatsheet

magrittr

plotly

Thanks!

Michael Clark
Consulting for Statistics, Computing & Analytics Research
Advanced Research Computing
University of Michigan

With notable assistance from:

Seth Berry
Mendoza College of Business
University of Notre Dame